In our Part 1, we analyzed a hotel booking dataset from the INN
Hotels Group to understand the factors influencing booking
cancellations.
hotel_data <- read.csv("../../Dataset/INNHotelsGroup_min.csv")
hotel_datay <- read.csv("../../Dataset/INNHotelsGroup.csv")
str(hotel_datay)
## 'data.frame': 36275 obs. of 19 variables:
## $ Booking_ID : chr "INN00001" "INN00002" "INN00003" "INN00004" ...
## $ no_of_adults : int 2 2 1 2 2 2 2 2 3 2 ...
## $ no_of_children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_weekend_nights : int 1 2 2 0 1 0 1 1 0 0 ...
## $ no_of_week_nights : int 2 3 1 2 1 2 3 3 4 5 ...
## $ type_of_meal_plan : chr "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
## $ required_car_parking_space : int 0 0 0 0 0 0 0 0 0 0 ...
## $ room_type_reserved : chr "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
## $ lead_time : int 224 5 1 211 48 346 34 83 121 44 ...
## $ arrival_year : int 2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
## $ arrival_month : int 10 11 2 5 4 9 10 12 7 10 ...
## $ arrival_date : int 2 6 28 20 11 13 15 26 6 18 ...
## $ market_segment_type : chr "Offline" "Online" "Online" "Online" ...
## $ repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_price_per_room : num 65 106.7 60 100 94.5 ...
## $ no_of_special_requests : int 0 1 0 0 0 1 1 1 1 3 ...
## $ booking_status : chr "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...
The dataset contains hotel bookings collected from 2017-2018 and is
characterized by a total of 19 columns, comprising 5 categorical and 14
numerical variables.
summary(hotel_data)
## no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## Min. :0.00 Min. : 0.00 Min. :0.00 Min. : 0.0
## 1st Qu.:2.00 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 1.0
## Median :2.00 Median : 0.00 Median :1.00 Median : 2.0
## Mean :1.84 Mean : 0.11 Mean :0.81 Mean : 2.2
## 3rd Qu.:2.00 3rd Qu.: 0.00 3rd Qu.:2.00 3rd Qu.: 3.0
## Max. :4.00 Max. :10.00 Max. :7.00 Max. :17.0
## type_of_meal_plan required_car_parking_space room_type_reserved lead_time
## Length:36275 Min. :0.000 Length:36275 Min. : 0
## Class :character 1st Qu.:0.000 Class :character 1st Qu.: 17
## Mode :character Median :0.000 Mode :character Median : 57
## Mean :0.031 Mean : 85
## 3rd Qu.:0.000 3rd Qu.:126
## Max. :1.000 Max. :443
## arrival_year arrival_month market_segment_type repeated_guest
## Min. :2017 Min. : 1.00 Length:36275 Min. :0.000
## 1st Qu.:2018 1st Qu.: 5.00 Class :character 1st Qu.:0.000
## Median :2018 Median : 8.00 Mode :character Median :0.000
## Mean :2018 Mean : 7.42 Mean :0.026
## 3rd Qu.:2018 3rd Qu.:10.00 3rd Qu.:0.000
## Max. :2018 Max. :12.00 Max. :1.000
## no_of_previous_cancellations no_of_special_requests booking_status
## Min. : 0.00 Min. :0.00 Length:36275
## 1st Qu.: 0.00 1st Qu.:0.00 Class :character
## Median : 0.00 Median :0.00 Mode :character
## Mean : 0.02 Mean :0.62
## 3rd Qu.: 0.00 3rd Qu.:1.00
## Max. :13.00 Max. :5.00
## avg_price_per_room
## Min. : 0
## 1st Qu.: 80
## Median : 99
## Mean :103
## 3rd Qu.:120
## Max. :540
Summary:
The average price per room in the dataset is 103 euros, with a median of
99 euros, but prices can reach up to 540 euros, indicating high-priced
outliers. Some entries even show an average price of zero, possibly
reflecting promotional deals. Guests typically stay for two weekday
nights and one weekend night, with the average number of weekday nights
being 2.2 and weekend nights 0.81. Stays can extend to as many as 17
weekday nights. Most bookings involve two adults, and many guests do not
bring children. Lead times vary significantly, with an average of 85
days, a median of 57, and some bookings made up to 443 days in advance,
suggesting a right-skewed distribution. The data also shows sparse
previous cancellations, with an average of just 0.02, and a maximum of
58. Bookings are spread over 2017 and 2018, peaking in August.
Additionally, most guests do not make special requests, as the median is
zero, although some make up to five requests per booking.
print(ggplot(hotel_data, aes(x = booking_status)) +
geom_bar(fill = "lightblue") +
labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))
Summary
The dataset contains 36,275 bookings. Of the total bookings, 24,390
bookings were not canceled (67.2%), while 11,885 bookings were canceled
(32.8%), reflecting a significant cancellation rate that offers rich
insights into factors influencing booking decisions.
print(ggplot(hotel_data, aes(x = factor(no_of_special_requests))) +
geom_bar(fill = "lightblue") +
labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
theme_minimal())
Summary
Most guests make no special requests, with a sharp decline as the number
increases. A significant portion makes one request, while two or more
requests are increasingly rare.
contingency_table <- table(hotel_data$market_segment_type, hotel_data$booking_status)
plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")
print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Booking Status by Market Segment",
x = "Market Segment",
y = "Count",
fill = "Booking Status") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
Summary
Online market segment have the highest cancellation count, with a large
proportion of canceled bookings of 8475. Offline segment has a
cancellation count of 3253 For other market segment, the cancellation
rates are too less.
ggplot(hotel_data, aes(x = lead_time, fill = booking_status)) +
geom_histogram(binwidth = 10, position = "dodge") +
labs(
title = "Lead Time vs Booking Status",
x = "Lead Time (Days)",
y = "Number of Bookings",
fill = "Booking Status"
) +
theme_minimal()
Summary
Here we can see that the booking with short lead times are less
cancelled and as the lead time increases there are more booking
cancellations. Also, we can see more cancellation happening between lead
time form 100-200.
ggplot(hotel_data, aes(x = booking_status, y = avg_price_per_room, fill = booking_status)) +
geom_boxplot() +
labs(
title = "Average Room Price vs Booking Status",
x = "Booking Status",
y = "Average Room Price"
) +
theme_minimal() +
theme(legend.position = "none")
Summary
The booking with cancelled status has higher median average room price
value compared to non-cancelled booking status.This might mean that the
booking with her average room price have more chances to get
cancelled.
1. Special Requests and Cancellations:
Bookings with special requests had a significantly lower cancellation
rate (20.2%) compared to those without (43.2%). Chi-square test
revealed a statistically significant association between special
requests and booking status.
2. Previous Cancellation History:
Surprisingly, guests with previous cancellations had a much lower
current cancellation rate (4.73%) compared to those without
(33.03%).
This suggests that past cancellation history might not be a
straightforward predictor of future booking behavior.
3. Factors Influencing Cancellations:
3.1 Lead Time:
3.2 Room Price:
4. Seasonal Variations:
The correlation between factors and cancellations varied across
seasons:
4.1 Fall:
Strongest lead time correlation (0.54).
4.2 Summer:
Special requests most negatively correlated with cancellations
(-0.30).
4.3 Spring:
Unique pattern with special requests having a strong negative
correlation (-0.36).
4.4 Winter:
Slightly different dynamics with weaker correlations.
hotel_data$booking_status_binary <- ifelse(hotel_data$booking_status == "Canceled", 1, 0)
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
data_filtered <- remove_zero_variance(select_if(hotel_data, is.numeric))
cor_data <- cor(data_filtered, use = "complete.obs")
corrplot(cor_data, method = "color", addCoef.col = "black",
title = "Correlation Matrix Recap", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
Lead time strongly predicts cancellations, with longer lead times
increasing likelihood (0.44). More special requests (-0.25) and repeated
guest status (-0.11) reduce cancellations, reflecting customer
commitment and loyalty. Price per room has a weak positive correlation
(0.14) with cancellations. Factors like travel party size and stay
length show minimal impact on cancellation likelihood.
1. Lead time emerged as the most critical factor in
predicting booking cancellations.
2. Special requests significantly reduce the likelihood
of cancellations.
3. Booking behavior varies considerably across
different seasons.
4. Higher-priced rooms show a slight tendency towards
more cancellations.
In this phase, we aim to build a predictive model for hotel booking cancellations and explore key factors driving cancellation behavior. The analysis will address the following critical questions:
Can we predict the likelihood of booking cancellation based on the
lead time?
Can we predict peak booking times and high-demand periods?
What is the relationship between room price and the likelihood of a
booking being canceled?
How do seasonal trends impact cancellation rates and booking
patterns?
Which factors most strongly influence booking cancellation
decisions?
To achieve these objectives, we will analyze customer attributes such as room price, lead time, and seasonal trends. Using logistic and Random forest models, we aim to predict the likelihood of booking cancellations and identify actionable insights. The results will help hotels effectively manage cancellation risks, optimize revenue strategies, and better anticipate high-demand periods.
Given that lead time demonstrated the strongest correlation with booking status, we initially selected lead time as our primary predictor variable and booking status as our target variable to begin our analysis.
set.seed(123)
train_index <- createDataPartition(hotel_data$booking_status_binary, p = 0.7, list = FALSE)
train_data <- hotel_data[train_index, ]
test_data <- hotel_data[-train_index, ]
logistic_model <- glm(
booking_status_binary ~ lead_time,
data = train_data,
family = binomial()
)
summary(logistic_model)
##
## Call:
## glm(formula = booking_status_binary ~ lead_time, family = binomial(),
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82038 0.02328 -78.2 <2e-16 ***
## lead_time 0.01180 0.00019 62.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32083 on 25392 degrees of freedom
## Residual deviance: 27091 on 25391 degrees of freedom
## AIC: 27095
##
## Number of Fisher Scoring iterations: 4
The analysis reveals a significant positive relationship between lead time and booking cancellation probability. As the lead time increases, the log-odds of a booking being canceled also increase, with a coefficient of 0.01180. This suggests that bookings made further in advance are more likely to be canceled. Both the intercept and lead time coefficient are statistically highly significant (p-value < 0.001), indicating that lead time is a reliable and strong predictor of cancellation risk.
test_predictions <- predict(logistic_model, newdata = test_data, type = "response")
predicted_classes <- ifelse(test_predictions > 0.5, 1, 0)
confusion_matrix <- table(Actual = test_data$booking_status_binary,
Predicted = predicted_classes)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
## Predicted
## Actual 0 1
## 0 6720 570
## 1 2102 1490
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- confusion_matrix[2,2] / sum(confusion_matrix[,2])
recall <- confusion_matrix[2,2] / sum(confusion_matrix[2,])
f1_score <- 2 * (precision * recall) / (precision + recall)
cat("\nModel Performance Metrics:\nAccuracy:", round(accuracy, 4), "\nPrecision:", round(precision, 4), "\nRecall:", round(recall, 4), "\nF1 Score:", round(f1_score, 4), "\n")
##
## Model Performance Metrics:
## Accuracy: 0.754
## Precision: 0.723
## Recall: 0.415
## F1 Score: 0.527
1. Accuracy: 0.754, meaning the model correctly
classifies 75.4% of the bookings.
2. Precision: 0.723, indicating that 72.3% of the
predictions of canceled bookings are correct.
3. Recall: 0.415, meaning the model correctly
identifies 41.5% of the actual canceled bookings.
4. F1-Score: 0.527, which is the harmonic mean of
precision and recall, providing a balanced evaluation of the model’s
performance.
roc_curve <- roc(test_data$booking_status_binary, test_predictions)
plot(roc_curve, main = "ROC Curve for Booking Cancellation Prediction")
The plot indicates that as the specificity (the ability to correctly
identify non-canceled bookings) increases, the sensitivity (the ability
to correctly identify canceled bookings) also increases. This suggests
that the model has good discriminative power in predicting booking
cancellations.
auc_value <- auc(roc_curve)
cat("Area Under the ROC Curve (AUC):", round(auc_value, 4), "\n")
## Area Under the ROC Curve (AUC): 0.75
Area Under the ROC Curve (AUC): 0.75, suggesting the model has good
discriminative power in predicting booking cancellations.
hotel_data$booking_status_binary <- as.numeric(as.character(hotel_data$booking_status_binary))
ggplot(hotel_data, aes(x = lead_time, y = booking_status_binary)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE) +
labs(
title = "Probability of Booking Cancellation vs Lead Time",
x = "Lead Time (Days)",
y = "Probability of Cancellation"
) +
theme_minimal()
The graph shows the relationship between the lead time (in days) and
the probability of booking cancellation. As the lead time increases, the
probability of booking cancellation rises in a non-linear fashion, with
the curve becoming more steep at higher lead times.
We created a high-demand column by taking the average price per room and identifying bookings in the top 25% of prices. In our random forest model, we used features like lead time, total nights, booking month, market segment, and repeated guest status to predict these high-demand bookings.
prepare_booking_data <- function(hotel_data) {
booking_data <- hotel_data %>%
mutate(
high_demand = as.factor(ifelse(
avg_price_per_room > quantile(avg_price_per_room, 0.75),
1, 0
)),
booking_month = as.factor(arrival_month),
is_weekend = no_of_weekend_nights > 0,
total_nights = no_of_weekend_nights + no_of_week_nights,
is_repeated_guest = repeated_guest > 0
)
return(booking_data)
}
prepared_data <- prepare_booking_data(hotel_data)
set.seed(123)
train_index <- createDataPartition(prepared_data$high_demand, p = 0.7, list = FALSE)
train_data <- prepared_data[train_index, ]
test_data <- prepared_data[-train_index, ]
# Random Forest Model
rf_model <- randomForest(
high_demand ~ lead_time +
total_nights +
booking_month +
market_segment_type +
is_repeated_guest,
data = train_data,
ntree = 500,
importance = TRUE
)
evaluate_model <- function(actual, predicted, model_name) {
conf_matrix <- confusionMatrix(as.factor(predicted), as.factor(actual))
results <- data.frame(
Model = model_name,
Accuracy = conf_matrix$overall['Accuracy'],
Precision = conf_matrix$byClass['Precision'],
Recall = conf_matrix$byClass['Recall'],
F1 = conf_matrix$byClass['F1']
)
return(results)
}
rf_pred <- predict(rf_model, newdata = test_data, type = "prob")[,2]
rf_class <- ifelse(rf_pred > 0.5, 1, 0)
rf_results <- evaluate_model(test_data$high_demand, rf_class, "Random Forest")
print(rf_results)
## Model Accuracy Precision Recall F1
## Accuracy Random Forest 0.81 0.847 0.912 0.878
The model performed impressively, with 81% overall accuracy. It correctly identified 91.2% of actual high-demand bookings (recall) and had 84.7% precision in its high-demand predictions.
rf_importance <- data.frame(
Feature = rownames(importance(rf_model)),
Importance = importance(rf_model)[,1]
) %>% arrange(desc(Importance))
print(head(rf_importance, 10))
## Feature Importance
## booking_month booking_month 100.9
## lead_time lead_time 92.9
## market_segment_type market_segment_type 85.6
## total_nights total_nights 50.4
## is_repeated_guest is_repeated_guest 12.9
ggplot(head(rf_importance, 10), aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Top 10 Features Predicting High-Demand Periods",
x = "Features",
y = "Importance"
) +
theme_minimal()
Feature importance revealed that booking month emerged as the most critical predictor, followed closely by lead time and market segment. This suggests that when a booking is made, its timing and market segment are strong indicators of whether it will be a high-value reservation.
predict_high_demand_periods <- function(model, test_data) {
pred_probs <- predict(model, newdata = test_data, type = "prob")[,2]
high_demand_periods <- test_data[pred_probs > 0.7, ]
print("High-Demand Periods Prediction:")
print(paste("Total High-Demand Periods:", sum(pred_probs > 0.7)))
print("Sample of High-Demand Periods:")
print(head(high_demand_periods))
return(high_demand_periods)
}
high_demand_periods <- predict_high_demand_periods(rf_model, test_data)
## [1] "High-Demand Periods Prediction:"
## [1] "Total High-Demand Periods: 1462"
## [1] "Sample of High-Demand Periods:"
## no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## 62 1 0 0 3
## 110 2 0 2 1
## 181 2 0 0 3
## 202 2 0 0 2
## 223 2 0 1 5
## 232 2 1 1 3
## type_of_meal_plan required_car_parking_space room_type_reserved lead_time
## 62 Meal Plan 1 0 Room_Type 4 19
## 110 Meal Plan 1 0 Room_Type 1 32
## 181 Meal Plan 1 0 Room_Type 1 11
## 202 Meal Plan 1 0 Room_Type 1 80
## 223 Meal Plan 1 0 Room_Type 1 56
## 232 Meal Plan 1 0 Room_Type 1 77
## arrival_year arrival_month market_segment_type repeated_guest
## 62 2018 5 Online 0
## 110 2017 9 Online 0
## 181 2017 9 Online 0
## 202 2017 7 Online 0
## 223 2018 9 Online 0
## 232 2018 5 Online 0
## no_of_previous_cancellations no_of_special_requests booking_status
## 62 0 2 Not_Canceled
## 110 0 3 Not_Canceled
## 181 0 1 Not_Canceled
## 202 0 1 Canceled
## 223 0 0 Not_Canceled
## 232 0 1 Not_Canceled
## avg_price_per_room booking_status_binary high_demand booking_month
## 62 120.1 0 1 5
## 110 94.5 0 0 9
## 181 80.8 0 0 9
## 202 76.5 1 0 7
## 223 119.0 0 0 9
## 232 135.2 0 1 5
## is_weekend total_nights is_repeated_guest
## 62 FALSE 3 FALSE
## 110 TRUE 3 FALSE
## 181 FALSE 3 FALSE
## 202 FALSE 2 FALSE
## 223 TRUE 6 FALSE
## 232 TRUE 4 FALSE
This function use the random forest model to predict probabilities for high-demand bookings in the test data and it identifies high-demand periods by filtering for predictions with a probability greater than 0.7.
# Fit a logistic regression model
lrmodel1 <- glm(booking_status_binary ~ avg_price_per_room, data = hotel_data, family = binomial)
summary(lrmodel1)
##
## Call:
## glm(formula = booking_status_binary ~ avg_price_per_room, family = binomial,
## data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.632519 0.036523 -44.7 <2e-16 ***
## avg_price_per_room 0.008695 0.000326 26.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45887 on 36274 degrees of freedom
## Residual deviance: 45153 on 36273 degrees of freedom
## AIC: 45157
##
## Number of Fisher Scoring iterations: 4
Here’s the interpretation of the logistic regression summary:
The positive coefficient for avg_price_per_room (0.0087, p-value < 2e-16) indicates that as room price increases, the likelihood of a booking being canceled also slightly increases. While statistically significant, the effect size is modest, suggesting that room price alone has a small influence on cancellations.
hotel_data$predicted_prob <- predict(lrmodel1, type = "response")
hotel_data$predicted_class <- ifelse(hotel_data$predicted_prob > 0.5, 1, 0)
# Model evaluation
conf_matrix <- table(hotel_data$booking_status_binary, hotel_data$predicted_class)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.67"
roc_curve <- roc(hotel_data$booking_status_binary, hotel_data$predicted_prob)
plot(roc_curve)
auc_value <- auc(roc_curve)
print(paste("AUC:", round(auc_value, 2)))
## [1] "AUC: 0.6"
The AUC score of 0.6 suggests that the model has limited ability to distinguish between canceled and non-canceled bookings based solely on room price.
# Boxplot for average price per room vs booking cancellation status
boxplot(avg_price_per_room ~ booking_status_binary, data = hotel_data,
main = "Boxplot of Avg Price per Room vs Booking Status",
xlab = "Booking Status", ylab = "Average Price per Room")
Key observations:
Median Price: The median room price for both non-canceled and canceled bookings is quite similar, suggesting that the central tendency of room prices does not vary significantly with booking status.
Distribution Spread: Both groups have a wide spread in room prices, but canceled bookings (status = 1) show slightly more variability in higher price ranges. The interquartile range (IQR) is comparable for both groups, indicating that most bookings (whether canceled or not) occur within a similar price range.
Outliers: There are more outliers in higher room price ranges for canceled bookings, suggesting that cancellations are more frequent among high-price bookings.
The average price per room has a slight association with booking cancellations, particularly at higher price ranges. However, the difference in the overall distribution of prices between canceled and non-canceled bookings is minimal.
hotel_data$season <- case_when(
hotel_data$arrival_month %in% c(3, 4, 5) ~ "Spring",
hotel_data$arrival_month %in% c(6, 7, 8) ~ "Summer",
hotel_data$arrival_month %in% c(9, 10, 11) ~ "Fall",
hotel_data$arrival_month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ "Unknown"
)
# Create subsets for each season
spring_data <- subset(hotel_data, season == "Spring")
summer_data <- subset(hotel_data, season == "Summer")
fall_data <- subset(hotel_data, season == "Fall")
winter_data <- subset(hotel_data, season == "Winter")
# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
# Remove zero variance columns for each season
spring_data_filtered <- remove_zero_variance(select_if(spring_data, is.numeric))
summer_data_filtered <- remove_zero_variance(select_if(summer_data, is.numeric))
fall_data_filtered <- remove_zero_variance(select_if(fall_data, is.numeric))
winter_data_filtered <- remove_zero_variance(select_if(winter_data, is.numeric))
# Calculate and visualize correlation for Spring
cor_spring <- cor(spring_data_filtered, use = "complete.obs")
corrplot(cor_spring, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Spring", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Summer
cor_summer <- cor(summer_data_filtered, use = "complete.obs")
corrplot(cor_summer, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Summer", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Fall
cor_fall <- cor(fall_data_filtered, use = "complete.obs")
corrplot(cor_fall, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Fall", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Winter
cor_winter <- cor(winter_data_filtered, use = "complete.obs")
corrplot(cor_winter, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Winter", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
### Logistic Regression: Impact of Seasonality on Booking Status
# Fit logistic regression model
logit_model <- glm(booking_status_binary ~ season, family = binomial, data = hotel_data)
# Summarize the logistic regression model
summary(logit_model)
##
## Call:
## glm(formula = booking_status_binary ~ season, family = binomial,
## data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6965 0.0187 -37.28 <2e-16 ***
## seasonSpring 0.0492 0.0304 1.62 0.11
## seasonSummer 0.3406 0.0276 12.32 <2e-16 ***
## seasonWinter -1.0447 0.0415 -25.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45887 on 36274 degrees of freedom
## Residual deviance: 44616 on 36271 degrees of freedom
## AIC: 44624
##
## Number of Fisher Scoring iterations: 4
# Calculate odds ratios for the season coefficients
odds_ratios <- exp(coef(logit_model))
print(odds_ratios)
## (Intercept) seasonSpring seasonSummer seasonWinter
## 0.498 1.050 1.406 0.352
# Predict probabilities using the logistic regression model
hotel_data$predicted_probs <- predict(logit_model, type = "response")
# Classify predictions using a 0.5 cutoff
hotel_data$predicted_class <- ifelse(hotel_data$predicted_probs >= 0.5, 1, 0)
# Confusion matrix
confusion_matrix <- table(Predicted = hotel_data$predicted_class, Actual = hotel_data$booking_status_binary)
print(confusion_matrix)
## Actual
## Predicted 0 1
## 0 24390 11885
# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.672
# Fit a decision tree with adjusted parameters
tree_model <- rpart(
booking_status_binary ~ season + lead_time + no_of_special_requests,
data = hotel_data,
method = "class",
control = rpart.control(cp = 0.001, minsplit = 10, maxdepth = 5) # Lower cp and minsplit to allow more splits
)
# Plot the updated decision tree
rpart.plot(tree_model, type = 3, extra = 102, fallen.leaves = TRUE,
main = "Decision Tree: Seasonal Trends and Cancellations")
hotel_data$type_of_meal_plan <- as.factor(hotel_data$type_of_meal_plan)
hotel_data$room_type_reserved <- as.factor(hotel_data$room_type_reserved)
hotel_data$market_segment_type <- as.factor(hotel_data$market_segment_type)
hotel_data$arrival_month <- as.factor(hotel_data$arrival_month)
rf_model1 <-randomForest(booking_status_binary ~ no_of_adults + no_of_children + no_of_weekend_nights +
no_of_week_nights + type_of_meal_plan + required_car_parking_space +
room_type_reserved + lead_time + arrival_year + arrival_month +
market_segment_type + repeated_guest +
no_of_previous_cancellations + avg_price_per_room + no_of_special_requests,
data = hotel_data, importance = TRUE, ntree = 100)
feature_importance <- importance(rf_model1)
importance_df <- data.frame(Feature = rownames(feature_importance), Importance = feature_importance[,1])
importance_df <- importance_df[order(-importance_df$Importance),]
importance_df
## Feature Importance
## lead_time lead_time 214.99
## no_of_special_requests no_of_special_requests 183.66
## arrival_month arrival_month 109.45
## avg_price_per_room avg_price_per_room 104.81
## market_segment_type market_segment_type 97.86
## no_of_week_nights no_of_week_nights 52.54
## no_of_weekend_nights no_of_weekend_nights 48.14
## required_car_parking_space required_car_parking_space 42.51
## no_of_adults no_of_adults 33.90
## type_of_meal_plan type_of_meal_plan 32.13
## room_type_reserved room_type_reserved 31.48
## arrival_year arrival_year 25.43
## no_of_children no_of_children 19.48
## repeated_guest repeated_guest 10.73
## no_of_previous_cancellations no_of_previous_cancellations 4.74
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip the axes to make it easier to read
labs(title = "Feature Importance from Random Forest Model",
x = "Feature", y = "Importance") +
theme_minimal()
The feature importance analysis reveals key factors that influence hotel booking cancellations:
Key Drivers:
Lead Time: The most important factor, with longer lead times increasing
the likelihood of cancellations. This suggests that customers are more
likely to cancel bookings made far in advance.
Number of Special Requests: Highly influential, indicating that
customers with fewer special requests are more likely to cancel,
potentially reflecting less commitment to their bookings.
Moderately Influential Factors:
Seasonality and Pricing: Arrival month and average price per room are
significant predictors, showing that cancellations are influenced by
seasonal trends and pricing strategies.
Market Segment: The type of market segment (e.g., corporate, leisure)
also has a considerable impact, reflecting differences in booking
behavior across customer groups.
Secondary Factors:
Stay Duration: The number of weekend and weekday nights has a moderate
impact, suggesting that longer stays might be more stable.
Customer Demographics: Features like the number of adults and children
in a booking contribute to predictions but are less critical.
Least Important Factors:
Repeated Guest Status: Returning guests are less likely to cancel but
have limited overall impact.
Previous Cancellations: Surprisingly, the number of past cancellations
has minimal influence, suggesting that other factors play a more
critical role in predicting cancellations.
# Converting booking status to numeric
hotel_data$booking_status_binary <- as.numeric(hotel_data$booking_status_binary)
hotel_data$market_segment_type_num <- as.numeric(hotel_data$market_segment_type)
hotel_data$arrival_month_num <- as.numeric(hotel_data$arrival_month)
corr_data <- hotel_data[, c("booking_status_binary", "lead_time", "no_of_special_requests",
"avg_price_per_room", "market_segment_type_num", "no_of_week_nights",
"no_of_weekend_nights", "arrival_month_num")]
#correlation matrix
cor_matrix <- cor(corr_data)
# Plotting the correlation matrix using corrplot
corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)
Strong Positive Correlation:
Lead Time and Booking Status: Lead time shows a strong positive
correlation with booking cancellations, indicating that longer lead
times are more likely to result in cancellations. This aligns with its
high feature importance.
Strong Negative Correlation:
Number of Special Requests and Booking Status: There is a strong
negative correlation, suggesting that bookings with more special
requests are less likely to be canceled, likely reflecting higher
customer commitment.
Moderate Correlations:
Market Segment and Booking Status: The market segment shows a moderate
positive correlation with cancellations, highlighting the role of
booking channels in cancellation behavior.
Average Price Per Room and Booking Status: There is a weaker but
positive correlation, indicating that higher room prices are slightly
more associated with cancellations.
Weak or Negligible Correlations:
Features such as arrival month show weak correlations with booking
status, suggesting a limited role of seasonality on its own.
The correlation matrix confirms that lead time and the number of special requests are the strongest predictors of booking cancellations. Moderate relationships with market segment and room price also provide actionable insights. However, other features show weak correlations, implying their limited independent impact on cancellations.
# Assuming you've already fitted the logistic regression model
lm_model1 <- lm(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room +
market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
data = hotel_data)
#summary(lm_model1)
gvif_values <- vif(lm_model1)
print(gvif_values)
## GVIF Df GVIF^(1/(2*Df))
## lead_time 1.25 1 1.12
## no_of_special_requests 1.21 1 1.10
## avg_price_per_room 1.44 1 1.20
## market_segment_type 1.70 4 1.07
## no_of_week_nights 1.08 1 1.04
## no_of_weekend_nights 1.06 1 1.03
## arrival_month 1.36 11 1.01
The Variance Inflation Factor (VIF) analysis indicates that multicollinearity is not a concern in the model, as all predictors have GVIF values below the commonly accepted threshold of 5. This suggests that the predictors are not highly correlated with one another, ensuring the stability and reliability of the model’s coefficient estimates.
We selected Random forest classifier for its ability to handle non-linear relationships and capture feature interactions effectively. Its performance was evaluated using metrics like accuracy, sensitivity, specificity, and balanced accuracy.
hotel_data$booking_status_binary <- as.factor(hotel_data$booking_status_binary)
# Fit Random Forest model for classification (binary target)
rf_model2 <- randomForest(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room +
market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
data = hotel_data, ntree = 100)
# Print the model summary to ensure it's treated as classification
print(rf_model2)
##
## Call:
## randomForest(formula = booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month, data = hotel_data, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 10.8%
## Confusion matrix:
## 0 1 class.error
## 0 22868 1522 0.0624
## 1 2386 9499 0.2008
Model Summary
The random forest classification model demonstrated strong performance
in predicting hotel booking cancellations. Using key predictors such as
lead time, number of special requests, average room price, and market
segment type, the model achieved an Out-of-Bag (OOB) error rate of
10.8%, indicating reliable generalization to unseen data. The confusion
matrix highlights the model’s ability to accurately predict
non-cancellations (class 0) with a low class error rate of 6.24%, while
cancellations (class 1) had a slightly higher error rate of 20.08%.
These results underline the effectiveness of random forest in handling non-linear relationships and capturing feature interactions. The insights from the model—such as the importance of lead time and special requests—offer actionable recommendations for managing booking cancellations more effectively. Future work could address the higher misclassification rate for cancellations by exploring class imbalance solutions or incorporating additional features, such as customer demographics or cancellation policies. Overall, the model provides a robust foundation for data-driven decision-making in the hospitality industry.
pred_class_rf <- predict(rf_model2, newdata = hotel_data)
# Confusion Matrix (Make sure both predicted and actual values have the same levels)
conf_matrix_rf <- confusionMatrix(pred_class_rf, hotel_data$booking_status_binary)
conf_matrix_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 23496 1592
## 1 894 10293
##
## Accuracy : 0.931
## 95% CI : (0.929, 0.934)
## No Information Rate : 0.672
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.842
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.963
## Specificity : 0.866
## Pos Pred Value : 0.937
## Neg Pred Value : 0.920
## Prevalence : 0.672
## Detection Rate : 0.648
## Detection Prevalence : 0.692
## Balanced Accuracy : 0.915
##
## 'Positive' Class : 0
##
# Print confusion matrix and the calculated metrics
conf_matrix_df <- as.data.frame(as.table(conf_matrix_rf))
conf_matrix_df
## Prediction Reference Freq
## 1 0 0 23496
## 2 1 0 894
## 3 0 1 1592
## 4 1 1 10293
colnames(conf_matrix_df) <- c("Actual", "Predicted", "Freq") # Rename columns for clarity
# Plot confusion matrix as heatmap using ggplot2
ggplot(conf_matrix_df, aes(x = Predicted, y = Actual, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "black", size = 6) +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Predicted", y = "Actual", title = "Confusion Matrix Heatmap") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The confusion matrix provides detailed insights into the performance of the random forest model for predicting hotel booking cancellations:
True Negatives (TN):
23,496 non-canceled bookings (class 0) were correctly predicted as
non-canceled.
False Positives (FP):
894 non-canceled bookings (class 0) were incorrectly predicted as
canceled (class 1).
False Negatives (FN):
1,592 canceled bookings (class 1) were incorrectly predicted as
non-canceled (class 0).
True Positives (TP):
10,293 canceled bookings (class 1) were correctly predicted as
canceled
# Predict class probabilities for the binary outcome (probabilities for each class, we need the second column for class 1)
pred_probs_rf <- predict(rf_model2, newdata = hotel_data, type = "prob")[, 2]
# Calculate ROC-AUC for Random Forest (use the second column for class 1 probabilities)
roc_curve_rf <- roc(hotel_data$booking_status_binary, pred_probs_rf)
# Plot the ROC curve
plot(roc_curve_rf, main = "ROC Curve for Random Forest Model")
# Print AUC value
auc_value_rf <- auc(roc_curve_rf)
cat("AUC:", round(auc_value_rf, 4), "\n")
## AUC: 0.986
# Extracting Precision, Recall, and Accuracy from the confusion matrix
precision <- conf_matrix_rf$byClass['Pos Pred Value']
recall <- conf_matrix_rf$byClass['Sensitivity']
accuracy <- conf_matrix_rf$overall['Accuracy']
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.936
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.963
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.931
The random forest model demonstrated excellent predictive performance for hotel booking cancellations. With an overall accuracy of 93.1% and a balanced accuracy of 91.5%, the model reliably differentiates between canceled and non-canceled bookings. The Kappa statistic of 0.842 further indicates strong agreement between predictions and actual outcomes, accounting for chance.
Key performance metrics include:
Sensitivity (Non-cancellations): 96.3%, showing the model’s high
ability to correctly identify non-canceled bookings.
Specificity (Cancellations): 86.6%, reflecting good performance in
identifying cancellations.
Positive Predictive Value (Precision): 93.7%, indicating confidence in
predicting non-cancellations.
Negative Predictive Value: 92%, showing reliability in predicting
cancellations.
AUC: 0.986 means the model can correctly differentiate between canceled
(class 1) and non-canceled (class 0) bookings in 98.6% of cases.
# Set up parallel processing (use all but 1 core)
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
# Reduce number of trees for faster cross-validation
ntree_val <- 50
# Perform cross-validation with Random Forest using 5-fold cross-validation
cv_model <- train(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room +
market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
data = hotel_data,
method = "rf",
trControl = trainControl(method = "cv", number = 5), # 5-fold cross-validation
tuneGrid = data.frame(mtry = 3), # You can adjust mtry for tuning
ntree = ntree_val) # Set ntree to a smaller value for faster computation
# Print the results of cross-validation
print(cv_model)
## Random Forest
##
## 36275 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 29020, 29020, 29020, 29020, 29020
## Resampling results:
##
## Accuracy Kappa
## 0.865 0.679
##
## Tuning parameter 'mtry' was held constant at a value of 3
# Stop the parallel cluster
stopCluster(cl)
Cross-validation was performed to ensure the reliability and robustness of the random forest model. It helps to evaluate the model’s performance across multiple subsets of the data, reducing the risk of overfitting and providing a more generalizable estimate of accuracy. Specifically, 5-fold cross-validation divides the dataset into five equally sized folds, training the model on four folds and testing it on the remaining fold, iteratively. This ensures that every data point is used for both training and testing, providing a reliable estimate of model performance.
Cross-validation was conducted to validate the model’s predictive performance and reduce overfitting. The accuracy (86.8%) and Kappa (0.686) indicate that the random forest model is robust and provides reliable predictions for booking cancellations and non-cancellations. This approach ensures that the model will generalize well to unseen data.
In this analysis, we successfully developed and evaluated a random forest model to predict hotel booking cancellations based on key features such as lead time, number of special requests, average room price, and market segment type. Through rigorous methods like 5-fold cross-validation, we demonstrated that the model achieves a high accuracy of 86.8%, with a Kappa value of 0.686, indicating substantial agreement between predictions and actual outcomes. The model’s robustness was further validated through performance metrics, including an out-of-bag (OOB) error rate of 10.9%.
Key findings highlight that lead time and the number of special requests are the most influential predictors, providing actionable insights for hotels to optimize revenue management strategies. While the model performs well overall, higher misclassification rates for canceled bookings suggest opportunities for improvement by incorporating additional predictors or addressing class imbalance.
This analysis emphasizes the importance of data-driven decision-making in the hospitality industry, offering valuable tools to anticipate cancellations and enhance operational efficiency. Future work could focus on integrating additional features, such as customer demographics or external factors (e.g., economic trends), to further improve prediction accuracy and model interpretability.
Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel bookings cancellation with a machine learning classification model. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 1100–1107). IEEE. https://ieeexplore.ieee.org/document/8260781
Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel booking cancellations to decrease uncertainty and increase revenue.Tourism & Management Studies, 13(2), 25–39. https://www.researchgate.net/publication/310504011_Predicting_Hotel_Booking_Cancellation_to_Decrease_Uncertainty_and_Increase_Revenue
Abeyrathne, C., & Bandara, H. M. N. D. (2023). Hotel booking cancellation prediction system using machine learning. Faculty of Engineering, University of Ruhuna. https://www.researchgate.net/publication/380515766_Hotel_Booking_Cancellation_Prediction_System_using_Machine_Learning